1 Introduction

This is a modeling exercise project similar to a kaggle competition. The goal is to build a predictive model.

1.1 Objectives

  • Build a predictive model that predicts the Amount variable. The model will be evaluated on the mean absolute error (MAE) between the predicted target and the actual target for the holdout dataset.
  • Determine what variables are most important for making your model predictions.
  • Prepare a .csv or .txt file containing your prediction for each ID in the same format as the sample provided in sample.txt.

I asked myself, what is the significance of using MAE instead of RMSE? And then I looked on google. When I do use a google source I try to keep track of the link.

MAE the mean absolute value difference between the predictions, \(\hat{y_i}\), and the target, \(y_i\). This means that positive errors are penalized in the same way as negative errors.

\[\text{MAE} = \frac{1}{n}\sum{|y_i - \hat{y_i}|}\]

Root mean squared error treats positive and negative errors equally, just like MAE, but imposes a harsher penalty outliers, observations where there is a large error. This is because it takes the square of the error.

\[\text{RMSE} = \sqrt{\frac{1}{n}\sum{(y_i - \hat{y_i})^2}}\] The takeaway is that outliers will be less of an issue in this analysis than in RMSE were used.

1.2 Data

The data is very ordinary. There are two data files. The train file has a target value, amount, which I am trying to predict. The test file does not have this label and will be used for final evaluation.

  • train.txt - training dataset
  • test.txt -holdout dataset
  • sample.txt - sample of the format for evaluation
train_raw <- read_csv("train.txt")
test_raw<- read_csv("test.txt")
sample_submission <- read_csv("sample.txt")

There is a reasonably large sample size, with 205062 in 51264, which is great because all models perform better with a larger sample size. A larger n generally reduces both the bias and the variance.

There are 34 features, which are all numeric and unnamed. We also see that there are no missing values, which makes life much easier!

head(train_raw , 10) 
## # A tibble: 10 x 35
##        ID     V1      V2     V3     V4     V5     V6     V7      V8     V9
##     <int>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
##  1  22144 -2.08  -1.63    1.25  -2.17   0.292 -0.956  0.811  0.167   0.756
##  2 251626  2.07   0.0784 -2.13   0.192  0.668 -0.622  0.140 -0.167   0.453
##  3 156283  1.96  -0.185  -0.286  0.330 -0.302 -0.139 -0.601 -0.0294  2.27 
##  4  56925 -2.04  -6.35   -0.646  0.920 -3.18   0.946  0.928 -0.0913 -0.277
##  5 271660 -0.826  0.729   1.18  -0.292 -0.563 -0.228  0.342  0.158   0.349
##  6 270117 -0.598  0.993   2.06   0.470  0.307  0.533  0.351  0.216  -0.683
##  7   4429  1.41  -0.668  -0.910 -1.62   1.46   3.27  -1.18   0.712   0.361
##  8  43651 -2.12   0.635   1.56   1.43  -1.39  -0.205 -0.403  0.931  -0.471
##  9  75283  0.885 -1.11    0.602  0.502 -1.03   0.633 -0.691  0.226  -0.691
## 10  45786  0.999 -0.180   1.18   1.07  -0.530  0.953 -0.748  0.477   0.467
## # ... with 25 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## #   V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## #   V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
## #   V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
## #   V32 <dbl>, V33 <dbl>, Amount <dbl>

2 Exploratory Analysis

For all data manipulations, I will combine the two data sets together in order to insure equal treatment.

combined <- train_raw %>% 
  mutate(source = "train") %>% 
  rbind(test_raw %>% mutate(Amount = "None", source = "test")) %>% 
  mutate(Amount = as.numeric(Amount))

The first item I look at is the distribution of the target, Amount. This is highly right skewed, and so I take a log transform. This helps to normalize the shape, which is useful for many modeling applications. We also see that there are significant spikes. I also looked at a box cox transform, which is a more general case of the log transform. In fact, the box cox is the same as the log when lambda is 0. When I tested his, lambda was fairly close to 0, so I just used the log. This is for simplicity and consistency between the different data sets.

This distribution has a very long tail.

train_raw %>% 
  sample_frac(0.2) %>% 
  ggplot(aes(Amount)) + 
  geom_histogram() + 
  ggtitle("Distribution of Amount") + 
  xlim(0, 100000)

A key assumption of many linear models is that the response be normally distributed. Taking the log transform helps to make this closer to a normal distribution.

combined <- combined %>% mutate(target = log(Amount + 1))

combined %>% 
  sample_frac(0.6) %>%
  ggplot(aes(sample = target)) + 
  stat_qq() + 
  stat_qq_line() + 
  ggtitle("Empirical Normal Quantiles vs Theoretical Quantiles after Power Transform")

This is better but still not perfect. I notice there is a spike at Amount = 100.100. There are a few other point masses as well. This is very common in insurance data especially due to deductibles.

combined %>% 
  group_by(Amount) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n))
## # A tibble: 31,117 x 2
##    Amount     n
##     <dbl> <int>
##  1   NA   28480
##  2  100.  12302
##  3  198.   5407
##  4   89.1  4364
##  5 1000.   4272
##  6 1502.   2912
##  7   76.1  2689
##  8 1001    2655
##  9  129.   2608
## 10  179.   2335
## # ... with 31,107 more rows

I create new categorical variables which capture these point masses.

combined <- combined %>% 
  mutate(spike = as.factor(coalesce(case_when(
           Amount %in% c(100.100, 198.198, 89.089, 999.999) ~ as.character(Amount),
           Amount < 100.100 ~ "zero"
         ), "none")))

After removing these, the distribution looks perfectly normal

combined %>% 
  sample_frac(0.8) %>% 
  filter(spike == "none") %>% 
  ggplot(aes(target)) + 
  geom_histogram() + 
  ggtitle("Distribution of Target After Removing 'Deductibles'")

As an extra precaution, I compared the two training and test_rawset distributions to see if they are taken from the same distributions. My strategy was to compare the medians, 1st quantiles, and 3rd quantiles.

They appear to be exactly the same.

first_quantile <- function(x){quantile(x, 0.25)}
third_quantile <- function(x){quantile(x, 0.25)}

combined %>% 
  group_by(source) %>% 
  select(-Amount, -ID, -target, -spike) %>% 
  summarise_all(funs(first_quantile, median, third_quantile
    )) %>% 
  gather(feature, stat, -source) %>% 
  spread(source, stat) %>% 
  mutate(percent_difference = abs((test - train)/train)) %>%
  arrange(desc(percent_difference)) %>% 
  datatable()

As seen in these graphs, many of these features have long positive and negative tails. They are all centred at zero, and most are even symmetric. This saves a lot of time because otherwise there would need to be transformations so that they look this way in order to use linear models.

combined %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  sample_frac(0.2) %>% 
  gather(column, value, 1:10) %>% 
  ggplot(aes(value)) + 
  geom_density() + 
  facet_wrap(vars(column), scales = "free")

combined %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  sample_frac(0.2) %>% 
  gather(column, value, 11:20) %>% 
  ggplot(aes(value)) + 
  geom_density() + 
  facet_wrap(vars(column), scales = "free")

V30 through V33 are uniform.

combined %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  sample_frac(0.2) %>% 
  gather(column, value, 21:33) %>% 
  ggplot(aes(value)) + 
  geom_density() + 
  facet_wrap(vars(column), scales = "free")

Coincidently, we see that the features are already ordered by their standard deviations!

combined %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  summarise_all(sd) %>% 
  gather(feature, value) %>% 
  arrange(desc(value)) 
## # A tibble: 35 x 2
##    feature value
##    <chr>   <dbl>
##  1 V1       1.96
##  2 V2       1.65
##  3 V3       1.52
##  4 V4       1.42
##  5 V5       1.38
##  6 V6       1.33
##  7 V7       1.24
##  8 V8       1.19
##  9 V9       1.10
## 10 V10      1.09
## # ... with 25 more rows

3 Outlier Analysis

The main difficulty with this data is the length of some of the tails. I looked at univariate stats to see if there was a pattern. Here I define an outlier as being above the 99.99th quantile or below the 0.0001st quantile.

IQR_outlier <- function(input_column, alpha = 0.001){
  x <- unlist(input_column)
  #return 0 if within 3*IQR and 1 if outside
  upper_bound <- quantile(x, 1 - alpha)
  lower_bound <- quantile(x, alpha)
  is_outlier <- (x < lower_bound | x > upper_bound)
  percent_outlier = mean(is_outlier, na.rm = TRUE)
  return(percent_outlier)
}

When running this over all observations, we again coincidently see that all columns have the same number of outliers.

train_raw %>% 
  select(-ID) %>% 
  map_df(IQR_outlier, alpha = 0.001) %>% 
  gather(column, percent_univariate_outlier) %>% 
  arrange(desc(percent_univariate_outlier)) %>% 
  datatable()

At this point I was beyond suspecting that the data was simulated. If this was the case, maybe there was a pattern to the outliers. Because these cause difficulty with linear models later on, I spent some time looking at them.

As seen below, some of the points are outside the IQR range for many different dimensions. Observation number 118164 for example is either above or below the 99.99st/0.001st quantile in 21 out of 33 dimensions.

outlier_summary <- train_raw %>% 
  select(-ID) %>% 
  map_df(function(input_column, alpha = 0.0001){
  x <- unlist(input_column)
  #return 0 if within 3*IQR and 1 if outside
  upper_bound <- quantile(x, 1 - alpha)
  lower_bound <- quantile(x, alpha)
  is_outlier <- (x < lower_bound | x > upper_bound)
  return(is_outlier)
  }) %>% 
  mutate(obs_number = row_number()) %>% 
  gather(feature, outlier, -obs_number) %>% 
  arrange(desc(obs_number)) %>% 
  group_by(obs_number) %>% 
  summarise(total_outliers = sum(outlier)) %>% 
  arrange(desc(total_outliers)) %>% 
  filter(total_outliers>0)
head(outlier_summary)
## # A tibble: 6 x 2
##   obs_number total_outliers
##        <int>          <int>
## 1     118164             21
## 2     117427             15
## 3      18186             12
## 4     185442             11
## 5     215346             11
## 6     247628             11

I try looking at these outliers within the data to see if there is a visible pattern. The sizes of the points represents the number of dimensions where it is outside the IQR. Observation number 274772 is guilty in more than 20 dimensions.

train_raw %>% 
  mutate(obs_number = row_number()) %>% 
  dplyr::slice(outlier_summary$obs_number) %>% 
  left_join(outlier_summary, by = c("obs_number")) %>% 
  mutate(total_outliers_label = ifelse(total_outliers > 10, ID, NA)) %>% 
  ggplot(aes(V2, V6, size = total_outliers)) + 
  geom_point(color='dodgerblue') + 
  geom_text(aes(label = total_outliers_label, hjust=1, vjust=1)) + 
  ggtitle("High Outlying Points Across Multiple Dimensions") 

Again, 274772 shows up, only this time in V6 and V7.

train_raw %>% 
  mutate(obs_number = row_number()) %>% 
  dplyr::slice(outlier_summary$obs_number) %>% 
  left_join(outlier_summary, by = c("obs_number")) %>% 
  mutate(total_outliers_label = ifelse(total_outliers > 10, ID, NA)) %>% 
  ggplot(aes(V7, V21, size = total_outliers)) + 
  geom_point(color='dodgerblue') + 
  geom_text(aes(label = total_outliers_label, hjust=1, vjust=1)) + 
  ggtitle("High Outlying Points Across Multiple Dimensions") 

train_raw %>% 
  mutate(obs_number = row_number()) %>% 
  dplyr::slice(outlier_summary$obs_number) %>% 
  left_join(outlier_summary, by = c("obs_number")) %>% 
  mutate(total_outliers_label = ifelse(total_outliers > 10, ID, NA)) %>% 
  ggplot(aes(V11, V28, size = total_outliers)) + 
  geom_point(color='dodgerblue') + 
  geom_text(aes(label = total_outliers_label, hjust=1, vjust=1)) + 
  ggtitle("Do you really expect to hide, 274772?") 

We can look at the assymetry of each feature from the skewness. Let’s look at the top 10 most skewed features in more detail.

skewed_features <- combined %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  summarise_all(skewness) %>% 
  gather(feature, skew) %>% 
  mutate(abs_skew = abs(skew)) %>% 
  arrange(desc(abs_skew)) %>% 
  select(feature) %>% 
  unlist() %>% 
  as.character()

combined %>% 
  select(skewed_features[1:10]) %>% 
  gather(column, value) %>% 
  ggplot(aes(value)) + 
  geom_density() + 
  facet_wrap(vars(column), scales = "free")

train_raw %>% 
  select(skewed_features[1:10]) %>% 
  dplyr::slice(outlier_summary$obs_number) %>% 
  gather(column, value) %>% 
  ggplot(aes(value)) + 
  geom_density() + 
  facet_wrap(vars(column), scales = "free")

4 Correlations

These features appear to be independent with the exception of V15 and V29.

correlation <- combined %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  cor()

corrplot(correlation,
         type = "upper")

V15 is a duplicate of v29. After dropping v29, let’s look again at the correlations.

combined %>% 
  sample_frac(0.2) %>% 
  ggplot(aes(V15, V29)) +
  geom_point() + 
  ggtitle("Column 15 is a duplicate of v29")

correlation <- train_raw %>% 
  select_if(is.numeric) %>% 
  mutate(target = log(Amount + 1)) %>% 
  select(-ID, -V29, -Amount) %>% 
  cor()

corrplot(correlation,
         type = "upper")

Which features have the highest correlation with the target Amount?

top_5_corr_with_amount <- data_frame(feature = paste0("V", 1:33), corr_with_target = as.numeric(correlation[, 'target'])) %>% 
  filter(corr_with_target != 1) %>% 
  mutate(abs_corr_with_amount = abs(corr_with_target)) %>% 
  arrange(desc(abs_corr_with_amount)) %>% 
  top_n(5) %>% 
  select(feature) %>% 
  unlist() %>% 
  as.character()

train_raw %>% 
  sample_frac(0.1) %>% 
  mutate(target = log(Amount + 1)) %>% 
  select(top_5_corr_with_amount, target) %>% 
  ggpairs()

5 Principal Component Analysis

PCA is a way of reducing the dimensions of a matrix by finding a linear subset of the features which explains most of the variance.

As seen below, most of the variation can be explained by the first principal component.

model_matrix <- train_raw %>% dplyr::select(-Amount)
pca <- prcomp(model_matrix, scale = T, center = T)
plot(pca, type = "l")

The first PC explains 6% of the total variance. THe other features each explain about exactly 3%. This suggests that the data was simulated. These features are almost perfectly independent. If they were perfectly independent, each variable would explain 1/33rd percent of the variance, which comes out to 0.0303….

summary(pca)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.44501 1.25170 1.02203 1.00889 1.00715 1.00669
## Proportion of Variance 0.06141 0.04608 0.03072 0.02994 0.02983 0.02981
## Cumulative Proportion  0.06141 0.10749 0.13822 0.16815 0.19799 0.22779
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.00541 1.00480 1.00350 1.00295 1.00290 1.00235
## Proportion of Variance 0.02973 0.02969 0.02962 0.02959 0.02958 0.02955
## Cumulative Proportion  0.25752 0.28722 0.31684 0.34642 0.37601 0.40556
##                          PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     1.0015 1.00124 1.00111 1.00076 1.00042 1.00009
## Proportion of Variance 0.0295 0.02948 0.02948 0.02946 0.02944 0.02942
## Cumulative Proportion  0.4351 0.46454 0.49402 0.52347 0.55291 0.58233
##                          PC19    PC20    PC21    PC22   PC23    PC24
## Standard deviation     0.9998 0.99933 0.99921 0.99888 0.9981 0.99782
## Proportion of Variance 0.0294 0.02937 0.02937 0.02935 0.0293 0.02928
## Cumulative Proportion  0.6117 0.64110 0.67046 0.69981 0.7291 0.75839
##                           PC25    PC26    PC27    PC28   PC29    PC30
## Standard deviation     0.99669 0.99613 0.99493 0.99370 0.9912 0.98988
## Proportion of Variance 0.02922 0.02918 0.02911 0.02904 0.0289 0.02882
## Cumulative Proportion  0.78761 0.81679 0.84591 0.87495 0.9038 0.93266
##                           PC31    PC32    PC33     PC34
## Standard deviation     0.98746 0.98415 0.58808 8.85e-15
## Proportion of Variance 0.02868 0.02849 0.01017 0.00e+00
## Cumulative Proportion  0.96134 0.98983 1.00000 1.00e+00

6 Pairwise Dotplots

train_raw %>% 
  sample_frac(0.2) %>% 
  mutate(target = log(Amount + 1)) %>% 
  select(2:6, target) %>% 
  ggpairs(mapping = aes(fill = target), progress = F)

train_raw %>% 
  sample_frac(0.1) %>% 
  mutate(target = log(Amount + 1)) %>% 
  select(7:12, target) %>% 
  ggpairs(mapping = aes(fill = target), progress = F)

These variables are useless. Don’t use these.

train_raw %>% 
  sample_frac(0.1) %>% 
  mutate(target = log(Amount + 1)) %>% 
  select(V13, V28, V29, V30, V31, target) %>% 
  ggpairs(mapping = aes(fill = target), progress = F)

7 Data Preprocessing for Modeling

This file will be where I do the model building.

7.1 Training/Test Split

We will split the train data into a smaller training set and a validation sets. What is a good size to split at? With less training data, there is more variance in the parameter estimates. If the test set is too small, there will be high variance in the predictions. If the training set is too small, there will be a lot of variance in the training parameters. The goal is to divide the data such that there is a good balance the two.

The process will be

  1. Split train into 80% training and 20% validation
  2. Use cross validation to fit models on the train set by tuning hyperparameters
  3. Use the 20% validation set to evaluate competing models
  4. Retrain the final model on the combined data after fixing hyperparameters
  5. Make final predictions based on test.txt.

Prepare data for training and testing.

#Remember to subtract 1 from predictions before submitting!!
df <- combined %>% select( -V29, -ID, -spike)

data_for_training <- df %>% filter(source == "train") %>% select(-source, -Amount)
data_for_predictions <- df %>% filter(source == "test") %>% select(-source, -Amount)
train_index <- createDataPartition(data_for_training$target, p = 0.8, list = FALSE) %>% as.numeric()

train <- data_for_training %>% dplyr::slice(train_index) %>% select(target, everything())
test <- data_for_training %>% dplyr::slice(-train_index) %>% select(target,everything())

#some models need a matrix instead of a data frame
train_x <- train %>% select(-target) %>% as.matrix()
train_y <- train$target
holdout_x <- data_for_predictions %>% dplyr::select(-target) %>% as.matrix()

#some models need a matrix instead of a data frame
test_x <- test %>% select(-target) %>% as.matrix()
test_y <- test$target
    
saveRDS(train_x, "train_x.RDS")
saveRDS(train_y, "train_y.RDS")

This is the data excluding the spikes. I thought about fitting a model for the “deductibles” seperatley. Using “Is_Spike” as a target, I built a model with 88% accurracy but never used it. As you can imagein, this could have been stacked on top of a regression model.

  df <- combined %>% filter(spike == "none") %>% select( -Amount, -V29, -ID)
  data_for_training <- df %>% filter(source == "train") %>% select(-source, -spike)
  data_for_predictions <- df %>% filter(source == "test") %>% select(-source, -spike)
  train_index <- createDataPartition(data_for_training$target, p = 0.8, list = FALSE) %>% as.numeric()
  train_no_spikes <- data_for_training %>% dplyr::slice(train_index) %>% select(target, everything())
  test_no_spikes <- data_for_training %>% dplyr::slice(-train_index) %>% select(target, everything())

8 Modeling

I fit a baseline model before doing any transformations. This allows me to assess if I’m making improvements reducing the “irreducible” error. I tried not to look at the test error when tuning parameters to avoid human-based overfitting.

eval_model <- function(input_model, input_data = "train") {
  if(input_data == "train"){
    model_prediction <- predict.train(input_model, train_x)
    df <- postResample(pred = model_prediction, obs = train_y)}
  else {
    model_prediction <- predict.train(input_model, test_x)
    df <- postResample(pred = model_prediction, obs = test_y)}
  data_frame("Data" = input_data, 'RMSE' = df[1], 'Rsquared' = df[2], 'MAE' = df[3])
  }

8.1 Baseline Linear Model

Fit a baseline model with the features which have the highest correlation with Amount.

  • This performs poorly on predicting high outliers of the target
  • Residuals are approximately normal
  • Residuals increase as predicted value increases
regressControl  <- trainControl(method="repeatedcv",
                    number = 5,
                    repeats = 1, #set this to 1 for now
                    returnResamp = "all"
                    ) 


baseline <- train(target ~ V2 + V7 + V5,
           data = train[-94590,],
           method  = "lm",
           trControl = regressControl)

plot(baseline$finalModel)

eval_model(baseline)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train  1.68    0.259  1.31
summary(baseline)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.157  -0.958   0.251   1.231  11.845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.568570   0.003708 2041.13   <2e-16 ***
## V2          -0.500492   0.002261 -221.37   <2e-16 ***
## V7           0.165245   0.003088   53.51   <2e-16 ***
## V5          -0.399248   0.002744 -145.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.679 on 205057 degrees of freedom
## Multiple R-squared:  0.2592, Adjusted R-squared:  0.2592 
## F-statistic: 2.392e+04 on 3 and 205057 DF,  p-value: < 2.2e-16

Drop the high-leverage outliers from the training data.

train <- train %>% 
  dplyr::slice(-c(14568, 172386, 200929))

8.2 GLM with all predictors

From fitting a model with all predictors first we see that V28 through V31 do not appear to be significant.

regressControl  <- trainControl(method="repeatedcv",
                    number = 10,
                    repeats = 3, #set this to 1 for now
                    returnResamp = "all"
                    ) 


GLM_all <- train(target ~ ., #these have the highest correlations with the Amount
           data = train,
           method  = "lm",
           trControl = regressControl)

plot(GLM_all$finalModel)

eval_model(GLM_all)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train  1.58    0.341  1.21

Based on the p-values being greater than 0.05, I drop the predictors V28 through V31 from the model.

regressControl  <- trainControl(method="repeatedcv",
                    number = 10,
                    repeats = 3, #set this to 1 for now
                    returnResamp = "all"
                    ) 


GLM_best <- train(target ~ ., #these have the highest correlations with the Amount
           data = train %>% select(-V28, -V30, -V31, -V33, -V13, -V32) ,
           method  = "lm",
           trControl = regressControl)

plot(GLM_best$finalModel)

eval_model(GLM_best)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train  1.58    0.341  1.21
summary(GLM_best)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.188  -0.890   0.254   1.136  12.857 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  7.569371   0.003497 2164.472  < 2e-16 ***
## V1          -0.086798   0.001791  -48.467  < 2e-16 ***
## V2          -0.501056   0.002133 -234.957  < 2e-16 ***
## V3          -0.039987   0.002320  -17.234  < 2e-16 ***
## V4          -0.030021   0.002474  -12.132  < 2e-16 ***
## V5          -0.399803   0.002591 -154.280  < 2e-16 ***
## V6           0.237545   0.002646   89.786  < 2e-16 ***
## V7           0.164891   0.002918   56.513  < 2e-16 ***
## V8          -0.028754   0.002932   -9.807  < 2e-16 ***
## V9          -0.140543   0.003190  -44.052  < 2e-16 ***
## V10         -0.018265   0.003230   -5.655 1.56e-08 ***
## V11         -0.082116   0.003433  -23.918  < 2e-16 ***
## V12         -0.032781   0.003511   -9.336  < 2e-16 ***
## V14          0.049866   0.003667   13.600  < 2e-16 ***
## V15         -0.112465   0.003824  -29.413  < 2e-16 ***
## V16         -0.204250   0.003998  -51.091  < 2e-16 ***
## V17          0.040542   0.004148    9.773  < 2e-16 ***
## V18          0.095243   0.004178   22.795  < 2e-16 ***
## V19         -0.016562   0.004293   -3.858 0.000114 ***
## V20          0.313050   0.004612   67.879  < 2e-16 ***
## V21          0.203565   0.004790   42.496  < 2e-16 ***
## V22          0.099897   0.004815   20.749  < 2e-16 ***
## V23         -0.068558   0.005577  -12.294  < 2e-16 ***
## V24         -0.040318   0.005775   -6.981 2.94e-12 ***
## V25         -0.018760   0.006704   -2.798 0.005135 ** 
## V26         -0.082478   0.007256  -11.367  < 2e-16 ***
## V27         -0.156191   0.008837  -17.675  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.584 on 205032 degrees of freedom
## Multiple R-squared:  0.3412, Adjusted R-squared:  0.3411 
## F-statistic:  4084 on 26 and 205032 DF,  p-value: < 2.2e-16

8.3 GLM with all predictors excluding spikes

regressControl  <- trainControl(method="repeatedcv",
                    number = 10,
                    repeats = 3, 
                    returnResamp = "all"
                    ) 

GLM_all_no_spike <- train(target ~ ., 
           data = train_no_spikes %>% select(-V28, -V30, -V31, -V33, -V13, -V32),
           method  = "lm",
           trControl = regressControl)

summary(GLM_all_no_spike)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.620  -0.774   0.138   0.897   8.564 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  7.993034   0.002878 2776.913  < 2e-16 ***
## V1          -0.061505   0.001470  -41.827  < 2e-16 ***
## V2          -0.413670   0.001715 -241.180  < 2e-16 ***
## V3          -0.073464   0.001940  -37.875  < 2e-16 ***
## V4           0.024382   0.001998   12.201  < 2e-16 ***
## V5          -0.296226   0.002044 -144.916  < 2e-16 ***
## V6           0.172774   0.002130   81.114  < 2e-16 ***
## V7           0.103964   0.002332   44.586  < 2e-16 ***
## V8          -0.054906   0.002554  -21.500  < 2e-16 ***
## V9          -0.078418   0.002604  -30.109  < 2e-16 ***
## V10         -0.078368   0.002767  -28.327  < 2e-16 ***
## V11         -0.050627   0.002845  -17.794  < 2e-16 ***
## V12          0.002081   0.002910    0.715 0.474648    
## V14          0.011834   0.003212    3.684 0.000229 ***
## V15         -0.070650   0.003102  -22.777  < 2e-16 ***
## V16         -0.165386   0.003214  -51.451  < 2e-16 ***
## V17          0.024038   0.003596    6.685 2.31e-11 ***
## V18          0.103477   0.003380   30.615  < 2e-16 ***
## V19         -0.044662   0.003449  -12.951  < 2e-16 ***
## V20          0.311606   0.003627   85.910  < 2e-16 ***
## V21          0.175724   0.004094   42.919  < 2e-16 ***
## V22          0.121305   0.003951   30.704  < 2e-16 ***
## V23         -0.093028   0.004370  -21.287  < 2e-16 ***
## V24         -0.045230   0.004694   -9.635  < 2e-16 ***
## V25          0.080241   0.005450   14.723  < 2e-16 ***
## V26         -0.084623   0.005805  -14.578  < 2e-16 ***
## V27         -0.209188   0.007070  -29.588  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.197 on 175380 degrees of freedom
## Multiple R-squared:  0.3987, Adjusted R-squared:  0.3986 
## F-statistic:  4473 on 26 and 175380 DF,  p-value: < 2.2e-16
eval_model(GLM_all_no_spike)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train  1.66    0.329  1.24
plot(GLM_all_no_spike$finalModel)

We look at the residuals vs fitted.

test_no_spikes %>% 
  mutate(predictions = predict.train(GLM_all_no_spike, test_no_spikes)) %>% 
  ggplot(aes(target, predictions)) + 
  geom_point() + 
  ggtitle("Predictions vs Target for GLM excluding spikes")

8.4 XGBoost Tuning Process

To deal with non-linear relationships, we need a more flexible model.

The tuning parameters are

  • nrounds: Number of trees, default: 100
  • max_depth: Maximum tree depth, default: 6
  • eta: Learning rate, default: 0.3
  • gamma: Used for tuning of Regularization, default: 0
  • colsample_bytree: Column sampling, default: 1
  • min_child_weight: Minimum leaf weight, default: 1
  • subsample: Row sampling, default: 1

We’ll break down the tuning of these into five sections:

  • Step 1. Fixing learning rate eta and number of iterations nrounds
  • Step 2. Maximum depth max_depth and child weight min_child_weight
  • Step 3. Setting column colsample_bytree and row sampling subsample
  • Step 4. Experimenting with different gamma values
  • Step 5. Reducing the learning rate eta

I can create a diagram of what a tree looks like using the DiagrammeR package.

dummy_xgb <- xgboost(data = train_x, label = train_y, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "reg:linear")
## [1]  train-rmse:1.661409 
## [2]  train-rmse:1.589086
xgb.plot.tree(colnames(train_x), model = dummy_xgb)

8.5 XGBoost Tuning

As a baseline to the xgboost, we’ll first fit using the default parameters.

grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 3,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

train_control <- caret::trainControl(
  method = "none",
  verboseIter = TRUE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

#avoid needing to refit model when knitting doc

# xgb_base <- caret::train(
#   x = train_x,
#   y = train_y,
#   trControl = train_control,
#   tuneGrid = grid_default,
#   method = "xgbTree",
#   eval_metric = "mae"
# )

#saveRDS(xgb_base, "Models/xgb_base.RDS")
xgb_base <- readRDS("Models/XGB_base.RDS")
eval_model(xgb_base)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train  1.10    0.686 0.772

We’ll start with the “bigger knobs” to tune and then use these settings to find the best of the “smaller knobs”, and then come back and refine these more significant paramters. We start by fixing the number of trees. This controls the total number of regression trees to use. This is selected in combination with the learning rate. Using a lower learning rate updates the predictions more slowly and so requires a larger number of iterations, or nrounds in order to minimize the loss function. Setting this too high eventually leads to instability. Using more trees and a lower learning rate is almost always better, but has diminishing returns. To start, in order to reduce compute time when choosing the other parameters, we set this to 1000. After the other parameters have been chose, we will come back and turn this up.

nrounds <- 500

Then we can fill in the other items, using suggestions from (here)[https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions/14].

t1 <-  Sys.time()

tune_grid <- expand.grid(
  nrounds = seq(from = 50, to = 500, by = 100),
  eta = c(0.1, 0.2, 0.4),
  max_depth = 3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 3, # with n folds 
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

# xgb_tune <- caret::train(
#   x = train_x,
#   y = train_y,
#   trControl = tune_control,
#   tuneGrid = tune_grid,
#   method = "xgbTree",
#   verbose = TRUE
# )

t2 <- Sys.time()

timediff <- t2 - t1

# saveRDS(xgb_tune, "Models/xgb_tune.RDS")
xgb_tune <- readRDS("Models/XGB_tune.RDS")
plot(xgb_tune)

From the plots above, we see that the best learning rate eta is at 0.4, which is a high value just to start. GBMs generally perform better with a larger number of trees, but it takes longer for the model to train. To make this faster, we’ll set the number of trees to 250 while tuning the other parameters.

Next, we move on to finding a good value for the max tree depth. We start with 3 +/- 1. The maximum depth controls the depth or “height” of each tree and helps to avoid overfitting. A higher depth can capture interaction effects better, but setting too high will overfit to the training set. We also try higher values of min_child_weight, which controls the minimum number of observations that are allowed in each end node. Higher values prevent overfitting.

The code below took 30 minutes to fit.

t1 <-  Sys.time()

tune_grid2 <- expand.grid(
  nrounds = c(150, 300, 500),
  eta = xgb_tune$bestTune$eta,
  max_depth = c(2, 3, 4),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = c(1, 2, 3),
  subsample = 1
)

# xgb_tune2 <- caret::train(
#   x = train_x,
#   y = train_y,
#   trControl = tune_control,
#   tuneGrid = tune_grid2,
#   method = "xgbTree",
#   verbose = TRUE
# )

timediff <-  Sys.time() - t1

xgb_tune2 <- readRDS("Models/XGB_tune2.RDS")

# saveRDS(xgb_tune2, "Models/xgb_tune2.RDS")

eval_model(xgb_tune2)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train 0.784    0.839 0.520
plot(xgb_tune2)

xgb_tune2$bestTune
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 24     500         4 0.4     0                1                2         1

We see that the best max depth is 4 with min_child_wight of 2.

Finally, now that the tree parameters are turned, I go back and tune the final boosting parameters.

Unfortunately, this was overfitting because although the train MAE is lower, the test MAE is higher. In a perfect world, this could be rerun with a higher number of trees to further improve the performance.

t1 <-  Sys.time()

tune_grid3 <- expand.grid(
  nrounds = c(600, 900, 1200),
  eta = c(0.08, 0.1, 0.3),
  max_depth = 4,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 2,
  subsample = c(1, 0.8)
)

# xgb_tune3 <- caret::train(
#   x = train_x,
#   y = train_y,
#   trControl = tune_control,
#   tuneGrid = tune_grid3,
#   method = "xgbTree",
#   verbose = TRUE
# )

timediff <-  Sys.time() - t1

# saveRDS(xgb_tune3, "Models/xgb_tune3.RDS")
xgb_tune3 <- readRDS("Models/XGB_tune3.RDS")

plot(xgb_tune3)

eval_model(xgb_tune3)
## # A tibble: 1 x 4
##   Data   RMSE Rsquared   MAE
##   <chr> <dbl>    <dbl> <dbl>
## 1 train 0.699    0.872 0.460

8.6 Model Selection

What does the error look like against each incremental tuning step?

xgb_models <- list(xgb_base, xgb_tune, xgb_tune2, xgb_tune3)

xgb_models %>% 
  map_df(eval_model) %>%
  mutate(GBM_tuning_step = 1:4) %>% 
  rbind(
    xgb_models %>% 
    map_df(., .f = eval_model, "test") %>%
    mutate(GBM_tuning_step = 1:4)
    ) %>% datatable()

8.7 Making Predictions

Remember to reverse the log and subtract 1 from target before submitting predictions!

#write the predictions to a csv file
make_predictions <- function(model, write = F){
  output_path <- paste0("Predictions/",as.character(model$method)," Predictions - ", format(Sys.time(), '%d %B %Y'), ".txt")
  pred <- (exp(predict.train(model, holdout_x)) - 1)
  
  if(write == T){
    data_frame(ID = test_raw$ID, Amount = pred) %>% 
      write_tsv(., path = output_path)
  } else{return(pred)}
  
}

xgb_predictions <- make_predictions(xgb_tune2)#This was used as the final model
glm_predictions <- make_predictions(GLM_best)

#write predictions to file
make_predictions(xgb_tune2, write = T)

8.8 Variable Importance

importance_df <- data_frame(Feature = rownames(varImp(xgb_tune2)$importance), Importance = varImp(xgb_tune2)$importance$Overall) %>% arrange(Importance)

importance_order <- importance_df$Feature

importance_df %>% 
  mutate(Feature = fct_relevel(Feature, importance_order)) %>% 
  ggplot(aes(Feature, Importance)) + 
  geom_bar(stat = "identity", fill = "dodgerblue") + 
  coord_flip()

varImp(xgb_tune2)
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 32)
## 
##     Overall
## V2  100.000
## V1   31.485
## V7   31.155
## V20  26.315
## V5   22.344
## V6    7.968
## V3    6.595
## V4    5.478
## V23   3.567
## V27   2.876
## V8    2.853
## V21   2.641
## V26   2.360
## V22   2.333
## V28   2.152
## V16   2.101
## V15   1.981
## V9    1.778
## V10   1.715
## V18   1.685
top_5_corr_with_amount
## [1] "V2"  "V5"  "V6"  "V20" "V16"

We can look at the partial dependence plots.

make_gg_partial <- function(input_feature){
  sample_index <- base::sample(x = nrow(train_x), size = 50000)
  partial_dep_data <- pdp::partial(xgb_tune2, pred.var = c(input_feature), train = train_x[sample_index,])
  partial_dep_data %>% 
  ggplot(aes_string(input_feature, "yhat")) + 
  geom_line() + 
  geom_point(colour = "dodgerblue")
}

V1_partial <- make_gg_partial("V1") + xlim(-4, 2.5)
V2_partial <- make_gg_partial("V2") + xlim(-4, 5)
V7_partial <- make_gg_partial("V7") + xlim(-4, 5)
V20_partial <- make_gg_partial("V20") + xlim(-4, 5)
V5_partial <- make_gg_partial("V5") + xlim(-4, 5)

ggarrange(V1_partial, V2_partial, V7_partial, V20_partial, V5_partial)

9 Sources

Reference Texts

R Software Packages

Online Articles (More than can be listed)

Generalized Linear Models for Insurance Ratemaking A Gentle Introduction to XGBoost for Applied Machine Learning *An End-to-End Guide to Understandign XGBoost

Various Kaggle repositories